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Abstract 

We apply Renormalization Group (RG) techniques to Classical Information Theory, 
in the limit of large codeword size n. In particular, we apply RG techniques to 
(1) noiseless coding (i.e., a coding used for compression) and (2) noisy coding (i.e., a 
coding used for channel transmission). Shannon's "first" and "second" theorems refer 
to (1) and (2), respectively. Our RG technique uses composition class (CC) ideas, 
so we call our technique Composition Class Renormalization Group (CCRG). Often, 
CC's are called "types" instead of CC's, and their theory is referred to as the "Method 
of Types" . For (1) and (2), we find that the probability of error can be expressed as an 
Error Function whose argument contains variables that obey renormalization group 
equations. We describe a computer program called WimpyRG-Cl.O that implements 
the ideas of this paper. C++ source code for WimpyRG-Cl.O is publicly available. 
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1 Introduction 



Renormalization Group (RG) techniques [I] are a panoply of techniques that serve 
to obtain asymptotic limits. RG techniques usually apply to a system with a very 
large number of degrees of freedom that is described by a partition function Z . Most 
RG techniques comprise an iterative step (i.e., a step which is performed repeatedly) 
consisting of a decimation followed by a rescaling. Decimation involves reducing the 
number of degrees of freedom. Rescaling involves rescaling the variables of Z so as 
to bring Z to the same form it had before the previous decimation. (Curiously, in 
Roman times, the word "decimate" meant to kill 1 out of every 10 prisoners. The 
modern meaning of the word is more like killing 9 out of every 10). 

In this paper, we apply RG techniques to Classical Information Theoryj2]j3] 
in the limit of large codeword size n. In particular, we apply RG techniques to 
(1) noiseless coding (i.e., a coding used for compression) and (2) noisy coding (i.e., a 
coding used for channel transmission). Shannon's "first" and "second" theorems refer 
to (1) and (2), respectively. For (1), we consider the special case of Csiszar-Korner 
(CK) universal code. For (2), we consider the special case of random encoding and 
maximum-likelihood (ML) decoding. For these special cases of (1) and (2), we find 
that the probability of error can be expressed as an Error Function (see Appendix IXjl 
whose argument contains variables that obey RG equations. 

Of course, there is no unique way of applying RG techniques to Classical 
Information Theory. The way shown in this paper is new, to our knowledge. Our RG 
technique uses composition class (CC) ideas, so we call our technique Composition 
Class Renormalization Group (CCRG). Often, CC's are called "types" instead of 
CC's, and their theory is referred to as the "Method of Types". 

We end this paper by describing the internal algorithms and typical input 
and output of a computer program called Wimpy RG-C1.0 that implements the ideas 
of this paper. (The 1.0 is the version number. The C before the 1.0 stands for 
"Classical" , to distinguish this program from a Q (Quantum) version of WimpyRG 
that we expect to deliver in the future.) C++ source code for WimpyRG-Cl.O is 
publicly available, at www.ar-tiste.com/WimpyRG.html . 

This paper straddles two fields (RG and Classical Information Theory) which 
are seldom used together within previous literature. It is therefore most likely that the 
reader is not closely acquainted with both of these fields. To help readers acquainted 
with only one of these two fields, the author has strived to make this paper as self- 
contained as reasonably possible. 

Before embarking on long, complicated calculations, let us discuss a simple 
example that illustrates the manner in which we will apply RG ideas to Information 
Theory in this paper. 

We show in this paper that the probability of error for both noiseless and noisy 
coding can be expressed as an integral of the following type: 
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dx e~ nf{x) , (1) 



where n » 1. Suppose / : Reals — > Reals is a convex (i.e., shaped like a cup U) 
function with a minimum at xq. Let Ax = x — xo, A£ = £ — Xq, and F(Ax) = f(x). 
Then 7 can be rewritten as 

/■+oo 

7 = / dAx e ~ nF{Ax) . (2) 

7A£ 

I can be approximated as follows 



e -^(A0 (3) 

This approximation for 7 is the leading term of an asymptotic expansion. This method 
of obtaining asymptotic expansions of integrals is usually called Laplace's Method [3], 
named after the inventor of the closely related Laplace Transform. Unfortunately, 
the 7-approximation given by Eq. (JHJ) is poor for those A£ for which F(A£q) = 0. 
Indeed, e~ nF( ^ A ^ is indeterminate because nF(A£ ) = oo • 0. Our goal is to devise 
an 7-approximation that overcomes this limitation. 

Suppose, for example, that F is quadratic in Ax: 

F(Ax) = ^{Axf , (4) 

for some a > 0. Then we can do the integration in Eq.Q exactly in terms of Error 
Functions (see Appendix IXjl 

7 = J + °° dAx e~ n ^ Ax)2 (5a) 

Using RG ideas, we can generalize this result, valid only for a quadratic F, to more 
general types of F. In Eq.(J2J), let us rescale the parameters A£, n and the integration 
variable Ax, but keep the value of 7 fixed. Then 

7= [ +0 ° dAx A J e- nApA(Ax) , (6) 

where 7 is a Jacobian, and where, for some parameter s > 0, we define 

n A = e s n , (7) 

and 

F A (Ax) = F(Ax A ) = e~ s F{Ax) . (8) 
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For s = 5s where < 5s << 1, we get: 



5s 



-5F 



(9) 



From Eq.Q, we get the following "RG Equation": 

ds F 1 (A^)) ' 1 j 

where F n is the nth derivative of F, and we have replaced the symbol A by (s). Of 
course, this RG equation is trivial and can be solved immediately: 

AfW = F- 1 (e~ s F(AO) • 



'11' 



In the more complicated examples presented later in this paper, one gets a system of 
coupled RG equations with complicated boundary conditions. Such systems of RG 
equations usually cannot be solved exactly, but they can be solved numerically with 
a computer. 

We can calculate the Jacobian J as follows: 



Ax {Ss) = Ax + 5 Ax = Ax - 5s 



F\ ' 



so 



J- 



1 


dAx^ 






dAx 





1 - 5s 1 



FF 2 



(12) 



(13) 



Note that we are justified in setting J « 1 if we are only interested in finding / to 
leading order in n. 

Suppose A£ > 0. Since F(A£) is a convex function with minimum at the 
origin, as s increases (and therefore also n increases), then, according to Eq.(|l()|l. A£ 
decreases. Likewise, if A£ < 0, then as s increases, A£ increases. In both cases, A£ 
is attracted to zero as s increases. By making s large enough, we can make A£ small 
enough so that F is well approximated by its quadratic approximation: 



/ = / °° dAx^ J e- nis)Fis) ^ (14a) 
« e- nia)F ^ [ +0 ° dAx^ e- nM ^ A ^ 2 (14b) 

In Eq.dHJ), to go from line (a) to (b), we replaced F by its Taylor expansion up to 
second order (this is valid for very large s) and we approximated J by one (this is 
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valid to leading order in n). Eq. ()14c|) is typical of the type of approximations that 
we propose in this paper. 

Before leaving our toy example, it is instructive to compare the /-approximation 
Eq. (jl4c|) to the exact answer in case F is quadratic. So assume F(0) = and 
F 2 (0) = a as in Eq.fllJ). For such an F, one can show from Eq.([ll|) that 

A£ (s) = e^A£ . (15) 
Furthermore, one can show from Eq.()13jl that 

J- 1 = . (16) 

By definition, 

n (s) = e s n . (17) 

Thus, 



/ = / °° dAx^ J e- ni8)F{8) ^ (18a) 

Hence, we see that for a quadratic F, /-approximation Eq. ()14cJ) differs from the exact 
answer by a factor of ei. This discrepancy is due to the fact that we neglected the 
Jacobian in deriving /-approximation Eq. (jl4cj) . 
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2 Notation 



In this section, we will present some notation that will be used throughout the paper. 

RHS and LHS will stand for "right hand side" and "left hand side", respec- 
tively. When we say "x (ditto, y) is A (ditto, B)" we will mean that x is A and y is 



The number of elements in a set S will be denoted by \S\. Let Z a h = {a, a + 
1, a + 2, . . . , b} for any integers a < b. Let x* n represent an n-tuple consisting of 
n copies of x. For example, ). Any of the following notations will 

be used to denote a set with indexed elements A, where i G Sf. {Ai}\/i = {Ai : 
Vz} = {Ai : Vz G Si}. Any of the following notations will be used to denote an 
ordered set (or vector) with components Af. A = (Aj)vi = (Aj : Vz). For example, 
we might refer to a matrix with elements Ay by [AyWy). The components of a 
vector A will be denoted by A, = (Aj : Vz)j. For any function / : 5^ — > Reals, let 
n{/(z)} V z = ricDes^/^)- We will sometimes abbreviate n{/(^)} V x h Y U{f}- If 
f G S" represents an n-letter codeword, we reserve the upper index location for the 

label of a letter in the codeword. Thus, we will denote the codeword x also by x' n , 
and its i'th component by (x) % = x % l n G S% for all i G Z l n . Reals nxm will represent 
n by m matrices with real entries. 

Given two sequences of real numbers (a n )v n and (6 n )vn where n G Z lj00 , we 
will often write a n ~ 6 n to mean that lim^oo f 21 = 1. 

pd(S) will represent all probability distributions on S; that is, all functions 
P : S — > [0, 1] such that J2 X £sP( x ) = 1- Random variables will be denoted by 
underlining. The set of all possible values that a random variable x can assume will 
be denoted by Sx- Let |Sy = A^. For any x G Sx, the probability P(x = x) = Px(x) 
often will be abbreviated by P(x) if this will not lead to confusion. Likewise, for two 
random variables x,y, S^ y = S K x S y = {(x, y) : x G S%, y 6 5 B } and N &y = = 
Nx_N y . The probability P(x = x,y = y) = P^ y (x,y) often will be abbreviated by 
P(x, y) if this will not lead to confusion. 

For any statement S, let 9(S) denote the "truth function" or "indicator func- 
tion": it equals 1 if S is true and it equals if S is false. For example, 9(x > 0) is the 
unit step function. The Kronecker delta function is defined as 5(x, y) — 6% — 6(x = y). 
Its continuum version, the Dirac delta function, is defined by 



for some infinitesimal e > 0. The Dirac function S(x) has unit area: S(x) = 1, 
and is sharply peaked at x = 0. The identity 8(x) = -^0(x > 0) is easily proven using 
the sharply peaked and unit area properties of 5(x). This identity connecting the 
Dirac delta function and the unit step function leads us to suspect that there is an 
integral representation for the unit step function, analogous to Eq. (jl9j) for the Dirac 
delta function. Indeed, there is. Suppose K > 0. The following equation is easy to 



B. 




(19) 
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prove using contour integration in the complex plane: 



[x > 0) 



1 

2m Jk 



dk 

T 



kx 



(20) 



See FigO For x > 0, the integration contour can be deformed so that it wraps around 
the point k — 0. By integrating around this pole, it is easy to show that for x > 0, 
the RHS of Eq. ()20|) equals 1. For x < 0, the integration contour can be deformed so 
that it wraps around the point k = +oo. Thus, for x < 0, the RHS of Eq. (ffil|) equals 
0. 



K 






Figure 1: For complex integral Eq. (ffil|) . one can deform the contour of integration 
differently for x < and x > 0. 



The Shannon entropy associated with the random variable x will be repre- 
sented by any of the following: 

HpM = H(Px) = H(P) = H(P(x)) yx = -Yl p ( x ) ln p ( x ) ■ (21) 

x 

Likewise, the relative entropy (also called the Kullback Liebler distance) between two 
probability distributions P(x) and Q(x) will be represented by any of the following: 

D{PJ/Q*) = D(P//Q) = D{P{x)//Q{x)) Wx = £ P{x) In . (22) 
We will also use the conditional entropy 
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H(x\y) = - J £P{x,y)hiP(x\y), (23) 



x,y 



and the mutual entropy: 



H(x:y) = J2P(x,y)ln ^^- . (24) 

P{x)P{y) 

Note that we have defined our entropies in terms of base e rather than base 2 logs. 
Of course, log (z) = gjf so log 2 x = 

Let "DP = n {^-P( a; )}vx- For an y function / : Reals N - — > Reals, define 

[vPf(P)=U\r dP(x)\ f(P), (25) 



and 



/ VP f(P) = [ VP 6{P > 0)6(£ Pi*) ~ Vf(P) • (26) 



It is easy to prove by induction that 

/ VP 1 = — - . (27) 

3 Composition Classes 

In this section we will discuss composition classes (CC's). Often, CC's are called 
"types" instead of CC's, and their theory is referred to as the "Method of Types". 
The term "type" is very vague, so we will shun it, and use the more specific term CC. 
This section reviews and extends standard material on CC's as found in, for example, 
the books by Cover and Thomas [2] and the one by Blahut 

In the mathematical theory of Statistics, one often considers a sequence of 
n random variables ( £ S™. Information Theory also deals 

with such sequences, where they are called a word (or codeword or block) of letters 
(or symbols) x from the alphabet S" £ . We will assume the simplest case, wherein 
the n random variables are independent, identically distributed (i.i.d.), and each x l is 
distributed ("drawn") according to a probability distribution Q : Sx — > [0, 1]. In what 
follows, we will often refer to Q as the Center of Mass ( CM) probability distribution, 
(The reason for this name will be explained later.) 

Let n(x\x) represent the number of times that the letter x occurs in the word x. 
A composition class C(x) (also called a "type" or "empirical distribution" or "relative 
frequency") is defined by 

C(x) = {y : Va; G S x , n(x\x) = n(x\y)} . (28) 
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Clearly, this defines an equivalence relation on (and a disjoint partition for) the set 
To each CC, there corresponds a probability distribution given by 

Pc&(x) = ^ (29) 
n 

for all x G Sx- In the notation C(x), the CC is specified by giving one of its elements 
x. Alternatively, one can specify a CC by giving its probability distribution: 

C(P) = {x:P c{£) = P} . (30) 

Hence C(P c{ y)) = C{y). 

( S n \ ( a\ 

Define - to be the set of all 2 x n matrices \ - , where a 6 5™ and 
\S£ ) { b I 

-> ( S n 

b G S£ are n-dimensional row vectors. For some x^ n G ( J; ) , the CC denoted by 



S, 



b 



S . . .. _ . / 



a 



C(xl n ) — C \ -> is defined as before, as the set of all 2 x n matrices vl n G , 

' b j \si 

such that, for all column vectors x = I I with a G S a and b G Sj,, one has 

\b ' 

n(x\yl n ) = n(x\x^ n ). 

For any A C S™, it is convenient to define the following two sets: 

C(A) = {C(£) : Vf G A} , (31) 



= {P C{S) : Vf G A} C pd(^) . (32) 

Note that these two sets are in 1-1 correspondence. For A = S™, they become C(S™) 
and V (SI). 

For large n, we can easily estimate the number of elements in a CC and the 
number of C(x) for all x G S%. 



Claim 3.1 As n — > oo ; 



cxp 



#(*fc(S)) 



|C(f)|w , J (33) 



and 



n JV*-i 
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proof: 

The exact number of elements in C(x) is given by 

\C{x)\ = ^ r . ' .,. . (35) 
Recall the first term of Stirling's asymptotic expansion, for large n, of the factorial 



n\ « e~ n n n . (36) 

Applying this approximation to the factorials in Eq. (|35jl immediately yields Eq. (j33)) . 
Ref. proves that |C(x)| is bounded below and above as follows: 

(w + \ )Jy , < \°^)\ < ex P [nB(Pc&)] ■ (37) 

Since P C (g) = ^(n 1 ,n 2 , . . ■ , n N J , where m, n 2 , . . . , n N ^ G Z 0>n , it follows that 
the exact number of CC's in pd(S x ) is given by 



n n 



\C(S:)\= E E ■•• E S(£n j} n). (38) 

ni=0ri2=0 um x =0 j=l 

The previous equation immediately implies that 

|C0S£)|<(n + iyV (39) 
Suppose / : Reals — >■ Reals. For large n: 

E /(*) « / dk f(k) . (40) 

For any n: 

E 5(k, fa) = 9(0 <k <n) = T dk 6(k - k ) . (41) 

fc=0 Jo 
We can use the previous two equations to approximate all sums in Eq. (jSHJ) by integrals. 
This yields: 

rn+l rn+1 rn+1 ^— 

\C{S%)\ « / dm / dn 2 ---/ dn^ <5(E™i - n) (42a) 
JO Jo Jo - JT^ 

N x 



n^-i f 1 dPl f 1 d p 2 ... f 1 d p Nx _ 5 (£ Pj _ 1) (42b) 

Jo .7=1 



Jo 



(iV,-l)! ' 



(42c) 
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QED 

Let Q(x) stand for the joint probability of the components of x. Since we will 
assume that these components are i.i.d., 



For A C S£, let 



(43) 



Q(A) = £ QW ■ 
Q(x) can be expressed in terms of relative entropy as follows: 



(44) 



Q(x) 



UiQi 



X 



n(x\x) 



V.i: 



exp 



n 



x 



exp [-nH(P C{3) ) - nD{P c{i) //Q) 



(45a) 
(45b) 
(45c) 



Combining this expression for Q(x) with the approximation Eq. (j3^j) for |C(x)| yields 



Q(C(x)) = \C(x)\Q(x) 



exp 



-nD{P c{ x)//Q) 



(2tto 



(46a) 
(46b) 



H(P) 




pd( S x ) 



Figure 2: Probability simplex pd(Sx) for Nx = 3. Two especially important points of 
the simplex are its geometric center Q and its center of mass (CM) Q. The graph on 
the right illustrates how the entropy H(P) decreases monotonically as point P moves 
from the geometric center to the edges. 
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The set pd(S x ) is in 1-1 correspondence with a simplex in space Reals N -. For 
example, for N x = 3, this probability simplex is the region of Reals 3 that connects 
the corners (1,0,0), (0,1,0) and (0,0,1). FigO shows pd(S^) for jV £ = 3. The 
probability distributions Pc(x) form a finite subset of this simplex. In Fig|2l the 
Pc(x) are represented by dots inside pd(Sx) . Other notable points of pd(Sx) are 

its geometric center Q — I j^- J and the CM distribution Q(x). From Eq. ()33|l it 

follows that the closer a CC is to the geometric center Q, the more elements the CC 
has. If we represent CC's by points of pd(S^) with varying diameters, where fatter 
points represent CC's with more elements, then the diameter of the points decreases 
as we travel away from Q. From Eq. (j46|) . it follows that the closer a CC is to the 
CM distribution Q(x), the more probable the CC is. As in Fig|21 if we show only 
the most probable CC's, then most of the CC's shown cluster around the point Q(x) 
(This is why we call Q(x) and Q(x) the CM distribution.) 

As mentioned in the introduction, most RG methods comprise an iterative 
step, (i.e., a step that is performed repeatedly) consisting of a decimation followed 
by a rescaling. CCRG is slightly different from this. In CCRG, we perform a pre- 
liminary reduction that reduces a very large (i.e., infinite as n — ► oo) number of 
degrees of freedom to a small, fixed (i.e., n independent) number. This is accom- 
plished by replacing sums like that run over n discrete degrees of freedom, by 
integrals like / T>Px, that run over the far fewer N% continuous degrees of freedom that 
specify a point of pd(Sz). After this preliminary reduction, we perform an iterative 
step consisting of an infinitesimal rescaling of n followed by a rescaling of all other 
parameters in such a way that the form of the partition function is not changed by 
the iterative step. 

The following two claims embody the preliminary reduction step of CCRG. 
Claim 3.2 (Reduction Formula 1) Suppose f : pd(Sx) — > Reals. Define 

1(f) = r(n, VP —==f(P) , (47) 

where 



Then 



and 



x 

1(1) » n: . 
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(48) 
(49) 

(50) 



proof: 



£/(^c(S)) = E |C(f)i/(P c(5) ) (51a) 
x c(x)ec(ss) 

= | C (g)| Ec " ,e ?' |C(f)l/( ^' a) (51b) 

= | CK) |^rfW (51c) 

^c ( s)GP(S£) 

« 1^)1 - f ' p (51d) 
V(&) 

« /(/). (51e) 

In Eq.(|51|l. we went from line (d) to (e) by substituting previously derived values for 
\C(S»)\, \C(x)\ and f pd(Ss) VP1. This proves Eq.©. 

If we substitute / = 1 into the LHS of Eq.flHJ), we get N£. But what if we 
substitute / = 1 into the RHS of Eq. (|39j) Does this also yield iV"? Yes. Let's see how. 
Define AP(x) = P(x) — If we expand H(P) about the point Q , we get: (See 
Appendix [B] for a compendium of Taylor expansions related to Information Theory) 

H{P) « lnJV £ - ^ _T[AP(x)} 2 + 0((APf) . (52) 

For large n, most of 7(1) comes from the vicinity of Q. Since Q is far away from the 
boundary of the probability simplex, the constraint 9(P > 0) can be ignored in 1(1). 
Thus, 1(1) can be approximated by: 

,% p { r>N \ 
1(1) « r(n,N^N2 2 y PAP 5(£ A?(i)) exp ( _^^[AP(x)] 2 j (53a) 

« (53b) 

In Eq.()53|l. to go from line (a) to (b), we performed the integration using the Gaussian 
integration formulae of Appendix O QED 

Claim 3.3 (Reduction Formula 2) Suppose f : pd(Sx, y ) Reals. Define 



r(n,N xv — N v ) , 



.i_.ii 



8(P*,v_ > 0) II {TO - p c(y)(y))} yy 
e xp[nH P ^ y (x\y)} 

n{p(xb)} v 
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Then 



Ef( P I - \)« Af), (55) 



C 



y 



and 



proof: Clearly, 



J(l) » iV; . (56) 



Ef( p ( -\) = E / . \ )- (57) 

V y ; \vi 

We would like to transform the sum over the words x[ and y[ into a sum over "coarser" 
items: namely, a sum over CC's like C . These CC's are in 1-1 correspondence 

\yl J 

with their probability distributions P / _ \ , and a sum over these distributions can 

X\ 

yi 

be approximated by an integral over the probability simplex pd(S &y ). All this can 
be accomplished if we approximate the Kronecker delta for points y by a suitably 
normalized Dirac delta function for distributions Pc{y)- So let us do the following 
replacement: 

- KH{6(P cm (y) - Pc(dy)} yy ■ (58) 
We choose the value of the normalization constant K to be 



c 



Y / n{i J c(g)}exp[-ng(P c(j0 )] 

r(n,Ay 5(E,Pe®(y)-l) " 1 ' 

(Division by a Dirac delta function is allowed as an intermediate step, before taking 
the e parameter of Eq. (jl9j) to zero.) The reason for choosing this value for K is as 
follows. Using Reduction Formula 1 and Eq.(jSHJ), one gets 

1 = « r(n,N y ) [ VP y 6(P y > 0)5(£P £ (y) " 1) 

y J V 

eW[nH{P r-Kll{8(P cm (y) -P y _(y)L ■ (60) 



U{Py} 



The previous equation is satisfied for the value of K given by Eq. (|5Tj|) . 
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To show Eq.()55p. one replaces the Kronecker delta 5(y[,y) in the RHS of 
Eq.(|57p by a coarser delta, in accordance with the prescription Eq.(|58j). Then one 
applies Reduction Formula 1 to the result. This proves Eq.(J33J). 

If we substitute / = 1 into the LHS of Eg .([55)1. we get iV". But what if we 
substitute / = 1 into the RHS of Eq.()55|) Does this also yield iV"? Yes. Here is 
a sketch of the proof. The proof comprises two main steps: First, use the results 
of Appendix ID1 to convert J(l) from an integral of the form JY[ {dP(x,y)} yx (■) to 
a product over all y of integrals of the form J H{dP(x\y)} Vx (■). Second, apply the 
Gaussian integration formulae of Appendix O QED 



4 Noiseless Coding 

In this section we will discuss Noiseless Coding (i.e., a coding used in compression). 
In particular, we will calculate the probability of error, in the limit of large word size 
n, for compression using the Csiszar-Korner (CK) universal code. 



4.1 Error Model 

This section reviews the usual error model for compression using CK universal code. 
Subsequent sections will apply CCRG to it. 

Encoder Decoder 





E 


f > 


D 


f \ 


X 




m 




x' 


V J 








V ) 


Sx 
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si 




Figure 3: Encoding and Decoding maps for Noiseless Coding. 



A block source emits a stream of n-letter words (x 1 , x 2 , . . . , x n ) = x' n = x E S™. 
Each word is modelled as a sequence of n i.i.d. random variables x l distributed 
according to Q(x l ), where i G Z 1>n . 
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Suppose that, as shown in Fig|31 (l)Each word x G S 1 ™ is mapped by an 
encoder function E into a message E(x) = m G Z m- (2) Each message m is in 
turn mapped by a decoder function D into a word D(m) = x' G S 1 ™. A fr/ocA; code is 
characterized by: the probability distribution Q of its source, its encoder function E 
and its decoder function D. The block code is said to be universal if E and D do not 
depend on Q. 

Assume that |Image(£/)| = ^(S 1 ™)! ~ M. The compression factor or code raie 
i? of the encoder is defined by 

n n 

Note that if N% = 2, then log ^ M = where ra ou4 (ditto, n in ) is the encoder output 
(ditto, input) measured in bits. Note that R < In N% because M < N£. For a fixed 
rate block code, R is fixed as n — > oo 

The probability of error for the code is given by 

Perr = £ Q(f)0(£> O ^ f) . (62) 

x 

Assume a fixed rate block code and let 

R = R-N X ^±^. (63) 
n 

Of course, for large n , R ph R. Let 

A pass = {x G S 1 ™ : if (P o(50 ) < £}, A stop - SI - A pass . (64) 
l^passl < M because 

1 = 52\C(£)\9(H(Pcw)<R) (65a) 

< enHiPc<s MH(Pc(x)) < R) (65b) 

C(x) 

< e^|C(^)| (65c) 

< e nk {n + l) N ^ = e nR = M . (65d) 

If \A pass \ « M, then |A stop | w A£ - M = e nln ^ - e**. Since i? < In A,, \A stop \ » 
\A pass \ for large n. 

We can number the elements of A pass from 1 to |A pass |. Call m(x) the number 
assigned to x G A pass . The Cif universal code is a fixed rate block code with encoding 
and decoding functions defined by: 

E{x) = { m (^J eApaSS , (66) 
if x A pass 
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. E 1 (m) if m 6 Zi i < . , 

D{m) = { K J pasal . (67) 

any x £ A pass if m = 

Note that low entropy words (i.e., those x with H{P C (x)) < R) belong to A pass and 
are coded, whereas the high entropy words (i.e., those x with H(Pc(g)) > R) belong 
to A st0 p and are not coded. Thus, the CK universal code can be described as a low 
pass filter of word entropy. Why are low entropy words preferable to high entropy 
ones for coding? Because for R = H(Q), Q(A pass ) and Q(A stop ) are comparable even 
though | A pa;ss | << |Astop|- Note that 

9{D o E(x) + x) = 9{x i A pass ) = 9(H{P C(S) ) > R) ■ (68) 
Thus, for CK universal coding, 

Perr = £ Q(x)9(H(P C(g) ) > R) . (69) 

x 

Applying Reduction Formula 1 to the RHS of the previous equation yields 

. -nD{P//Q) 

Perr « r(n, Ay / VP 9{H{P) > R) . (70) 

In the previous equation, the exponential inside the integral reaches its maximum 
value when DiPj /Q) = 0. If we approximate P by Q in the theta function of the 
integrand, then we can pull the theta function out of the integral. Doing this yields 

p err « 9(H(Q) > R) . (71) 

In other words, if the compression factor R is larger (ditto, smaller) than H(Q), then 
the probability of error is zero (ditto, one). The next few sections of this paper will 
be dedicated to improving this estimate of p err . 



4.2 Old Approximation for p err 

In this section, we will review the standard calculation (see j3]) of the error exponent 
for CK universal coding. In the next section, we will calculate the error exponent 
(and much more) using CCRG. 

The standard way of finding the error exponent for CK universal coding is 
equivalent to using Laplace's Method to find the leading term of an asymptotic ex- 
pansion of Eq.(J7UJ). To apply Laplace's Method, we must minimize DiPj /Q) over all 
P G pd(Sx), subject to the inequality constraint H(P) > R. 

To obtain a minimum point x* G Reals n of a smooth, real-valued function 
f(x), subject to equality constraints Cj(x) = for j G C eq , one can use the well known 
method of Lagrange multipliers. But suppose that, in addition to these equality 
constraints, x* must also satisfy inequality constraints Cj(x) > for j G C geq . To 
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[ 












H(Q) R 

V ^ C/ max 



Figure 4: A pass when H(Q) is greater or smaller than R. Strictly speaking, A pass is 
a set of x, and what we are showing is V(A pass ) instead of A pass . 



obtain a minimum x* in this more complicated case, one can generalize the method 
of Lagrange multipliers. Kuhn and Tucker, among others, have done this. Let J = 
C eq U C geq , and define the Lagrangian function C = f(x) — ^j c j(.%)- According 
to Kuhn- Tucker, the minimum point x and the Lagrange multipliers (Aj)vjej must 
satisfy the Kuhn-Tucker conditions^ given by (1) V^L = 0, (2) Vj G C eq , one has 
Cj(x) = (3)Vj G C geq , one has Cj(x) > 0, \j > and \jCj(x) = 0. 
Let 

C = D(P//Q) - \(H(P) -R)+ ME P(*) - 1) • (72) 

For the problem we are considering here, the Kuhn-Tucker conditions are (1) Vx, 

0, {2)J2 x P{x) = 1, (3) H(P) - R > 0, A > 0, A(#(P) - i?) = 0. We will assume 

that the inequality constraint is "active" in which case condition (3) reduces to 



18 



H(P) = R. Condition (1) implies: 



= ln^l + l- \(-\nP(x) - l) + a (73a) 
Q[x) 

= ln(P 1+A (x))-lng(x) + l + A + /i. (73b) 
The previous equation is satisfied by 

PWW = 3^* , (74 , 

where 

z = Y,Q(*) Th - (75) 

X 

This value for P< A > satisfies E x piX) ( x ) = h but does not Y et satisfy H(P^) = R. 
The equation H(p( x >) = R defines a unique value of A. 
Define 

1 (X) = mmC = D(P^//Q). (76) 
Substituting the value for P^ given by Eq.flZU) into D{pW//Q) yields: 

7 (A) = XR- (1 + A)lnZ . (77) 

P*A) and 7(A) still depend on a parameter A which is specified implicitly by the 
equation H(P^) = R. In fact, one can show that #(pW) = R iff = °- 
Define the error exponent 7 by 

7 = max 7(A) . (78) 

A>0 

It is now clear that p err given by Eq.([70]) can be approximated by: 

Perr « e" n7 , where 7 = max[AP - (1 + A)Z(A)] . (79) 

Eq . (|79|) is the traditional j3j asymptotic approximation for the probability of error 
for CK universal coding. 

4.3 New (CCRG) Approximation for p err 

In this section and the next one, we will use CCRG to calculate the probability of 
error for compression using the CK universal code. In this section, we will calculate 
p err as given by Eq.ffTUj). assuming that we have rescaled the variables of the RHS of 
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Eq.flTUJ) so that the integrand is a Gaussian. In the next section, we will derive the 
RG equations that characterize this rescaling. 

Let P = P - Q, AH(P) = H(P) - H(Q), and AR = R - H(Q). Hence, 
9{H{P) > R) = 6(AH(P) > AR). 

Let 

C = D(P//Q) - X(AH(Q) - AR) + /i(]T P{x) - 1) . (80) 

X 

Minimizing this Lagrangian with respect to P, A, fi gives the saddle (or boundary) 
point P* that dominates the integral given by Eq. (J70|) . Unfortunately, finding an 
explicit expression for P* is not possible. 
Define test fractions $o and $1 by 



D(P*//Q) 



$1 



v [AP*(z)l 2 
2Q(x) 



H(P*) - H(Q) 



2) 



-Y. x AP*(x)\nQ(x) 

$o (ditto, $i) measures how much D(P* / /Q) (ditto, AH(P*)) differs from the leading 
term of its Taylor expansion about Q. (See Appendix El for a compendium of Taylor 
expansions related to Information Theory). 

Suppose we have rescaled the variables in the RHS of Eq. (fTT)j) so that after 
rescaling, we are in the "Gaussian region": $ << 1 an d 3>i << 1- Then Eq. (j70|) can 
be approximated by 



r (n, N, 

Perr ^ 



* i j „ 



U{P*} 

ex P[-^E^|y^] e(-J2^P(x)\nQ(x)>AR). (83) 

(For large n, if Q is not too close to the boundary of the probability simplex, then 
the constraint 8(P > 0) can be ignored.) 

In the Gaussian region, we can also approximate Eq. (J80|) by 

£ = E ^rr " A( " £ AP{x) ln Q{x) ~ AR) + ^ p{x) ~ 1} • (84) 

X Alqj\X) x x 

Minimizing this Lagrangian with respect to P, A, fi gives the point P* that dominates 
the integral given by Eq. (j83|) . Finding an explicit expression for P* in the Gaussian 
region is possible, g^h) = ® gi ves: 

^^ + Alng(x)+/i = 0. (85) 
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Enforcing the constraints — J2 x AP(x) hiQ(x) = AR and ^2 x P(x) — 1 then yields 

AP*(x) = B(x)AR , (86) 



where 



B(x) = ^1 , (87) 

0(x) = -[kiQ{x) + H{Q)\ , (88) 

^=EW( i ) = °. ( 89 ) 

a; 

(/? 2 ) = EW 2 W- (90) 



On the RHS of Eq.([83|). we can apply the Gaussian Integration Formulae of 
Appendix O We can also substitute there the value for P* given by Eq. (J86|) . Doing 
so finally gives 



2u \ V 2 (^ 2 ) 

where 



Perr ~ ^ eric u R LjlI) 5 (9i) 



u 



r mm 



(92) 



Appendix El reviews some basic properties of the Error Function erf() and its com- 
plement erfcQ. 

4.4 RG Equations 

In this section, we will calculate the RG equations for compression using CK universal 
coding. 

Important: In this section, AP^ describes the motion, upon successive rescal- 
ings, of the point that dominates the integral of Eq. (|70jl . 

Consider the argument of the exponential in the integrand of Eq.(j7DJ). It 
should be invariant under a change of scale: 

n A D\P//Q)=nD(P//Q) . (93) 
If for some 5 s such that < 5 s « 1, 

n A = e Ss n , (94) 
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then 



Define 



Then, for s > 0, 



where 



By virtue of Eq.flHSJ), 



where 



D\P//Q) = D{P A //Q) = e~ Ss D(P//Q) 



p\ x ) = p( 5s )(x) = (1 - lo Ss)P(x) + (j Q 5s)Q(x) . 



x 



- 7o (P (s) ,Q)AP (s) (^ 



ds 



7 o(P,Q) = lim 



1) 



ds 



o AP( fi ) 



mr^im = _ D{pM//Q) 

OS 



1 = lim 



1 ds 



s^o D(P( S )//Q) ' 



(95) 
(96) 

(97) 
(98) 
(99) 
(100) 



Note that 



Thus 



dD(pM//Q) 
lim 

s^O ds 



io(P,Q) 



S58?-aT^W + 1) 

-7oEAP(,)(ln^| + l) 
-y [D{P//Q)+D(Q//P)] . 



D(P//Q) 



(101a) 

(101b) 
(101c) 



(102) 



D(P//Q) + D(Q//P) • 

Now consider the theta function in the integrand of Eq.([70|). It too should be 
invariant under a change of scale: 



6(AH A (P) > AR A ) = 6{AH{P) > AR) . 
If for some 8s such that < Ss « 1, 



(103) 



AR A = e~^ lSs AR , 



(104) 
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then 



AH A (P) = AH(P A ) = e~^ 5s AH(P) . 
Eqs. (fTMl) and (fTUHJ) imply 



where 



and 



Note that 



ds 



V (s) 



-7i(P (s) ,Q)^ 



-1 



7l (P Q) = lim v / ^ = lim v / - ^ 



AH{P) = ~Y, AP(x) In P(a?) + D{Q//P) . 

X 

Interchanging P and Q in the previous equation also yields: 

- ah(p) = + ]T Ap ( x ) ln Q( x ) + D ( p //Q) 

X 

Note that 



(105) 



(106) 



(107) 



108) 



(109) 



(110) 



Thus, 



lim 

s^0 



dAH{pW) 
ds 



<9P( S ) 

(-l)HmE^-( lnp(s) (^) + l) 
s^o ds 

l0 [-AH(P)+D(Q//P)}. 



(111a) 
(111b) 



(112) 



We will call 70 and 71 the critical exponents for AP^ and AR^ S \ respectively. 
Note that jo(P, Q) and 71 (P, Q) both tend to \ as P — > Q. Note also that 70 
and 71 are related to the test fraction $x as follows. Define <fi > by 



(113) 



^-1 






7o 




AH(P) 
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Then 



$1 



AH(P)+Y, X AP(x) \nQ{x) 



E.AP(x) \nQ(x 
D(P//Q) 



AH(P) + D(P//Q) 



(114a) 
(114b) 



1 + 



(114c) 



In conclusion, we must solve the following pair of coupled RG equations 
dAP( s Hx s 



for all x e S x , and 



ds 



dAR^ 
ds 



-7o(Q + AP« Q)AP< 



;r 



(115) 



-7i 



(Q + AP^,Q)APW . 



(116) 



We must solve this pair of RG equations subject to the following pair of boundary- 
conditions: At s = 0: 

AP (0) = AP , (117) 

and at s = Sfi n : 

AP w (i) = B(x)AR ( - Sfin) . (118) 

Sfin is defined as any s large enough for the following to be true: $ (P( Sfln \ Q) « 1 
and $!(P( S ^,Q) « 1. 

Section |H1 describes a computer program called WimpyRG-Cl.O that solves 
these RG equations. 



5 Noisy Coding 

In this section, we will discuss Noisy Coding (i.e., a coding used in channel transmis- 
sion). In particular, we will calculate the probability of error, in the limit of large 
word size n, for channel transmission using random encoding and maximum-likelihood 
decoding. 

5.1 Error Model 

In this section we will review the error model for channel transmission using random 
encoding and maximum-likelihood decoding. Subsequent sections will apply CCRG 
to it. 
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Output 
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Encoder 
Input 



Figure 5: Encoding, Channel and Decoding maps for Noisy Coding. 



Suppose that, as shown in FigJ5J (l)Each message m G Z liM is mapped by an 
encoder function E into a word x G S™. (2) A channel Q{y\x) gives the probability 
that word x G S"™ is mapped into word y G S™ . (3)Each word y is then mapped 
by a decoder function D into message m! G Zo,a/- We assume a discrete memoryless 
channel, by which we mean that 

Qm = n{Q(y>% teZin ■ (H9) 

An (M, n) channel code is characterized by its encoding function E, the conditional 
probability of its channel Q(y\x), and its decoding function D. 

Let Perr\m be the probability of error when message m G Z X ^ M exits the encoder. 

Then 

Perr\m = Pr{D(y) ^ m\x = E(m)} . (120) 
The code rate R of the encoder is defined by 

n n 

Note that if N x = 2, then = (careful: for noiseless coding = ^ 

i ' n n out v & n n in 

instead), where n out (ditto, n« n ) is the encoder output (ditto, input) measured in bits. 
The maximum achievable rate R ma x.ach. is defined by: 

InM 

Rmax.ach. = hm .Km sup{ : 3(n, E, D)Vm,p err \ m (n, E, D) < e} . (122) 



25 



The information capacity C is defined by: 

C = n max H (x-.y). (123) 

The fact that R ma x.ach. = C is essentially Shannon's Noisy Coding (or "Second") 
Theorem) . 

Eq. ([120)1 can be re-expressed as 



Perr\m = ^Pr{D(y) ^ m\x = E(m),y = y}Pr{y = y\x = E(m)} (124a) 

y 

= j2 e ( D (y)^ m )Q(y\x( m )) ( 124b ) 

y 

= l-J28(D(y)=m)Q(y\x(m)) . (124c) 

y 

A random encoder E is defined by choosing each component of x = E(m) in- 
dependently from the other components and according to the probability distribution 
Q(x). With such an encoder, 



Perr = E Perr\m,EP (E) P(m) (125a) 

mez 1>M ,E 



mEZ-LM x(mi)e55 I ^ y 

VmieZi,M 



(125b) 



Suppose r : Sy x Z x ,m ~^ {true, false] is a condition, and Good(T) is the set 
of all y for which there is a unique m E Z\,m that satisfies r($f, m) = true. Also let 
Bad(T) = Sy — Good{T). One can define the decoding function D implicitly in terms 
of the condition V as follows: 



. unique m such that T(y, m) = true, if y E GoodCT) 
D{y) = { . (126) 

it y E Bad(l ) 

Hence, for m E Z\m and y E Good(T), 

6(D(ff) = m)=6(T(y,m)) . (127) 
The maximum likelihood (ML) decoder is defined by the condition 

m m) = > 1 Vm' E Z lM , m'^m) . (128) 

\Q(y\x(m')) ) 



26 



(As illustrated in Fig® we assume that Bad(T) is negligibly small, in the sense that, 
for all m G Z 1)M , J2y£Bad(r) Q(y\^( m )) << 1-) Actually, the ML decoder is not 
optimal. It can be shown(3| that the optimal decoder is one for which 

r(2/ ' m) = [ QWm'M 1,M ' m ^ m ) ■ (129) 




Figure 6: Intuitive picture of condition Eq. (jl28j) for Maximum Likelihood decoder. 



For each (x, y) e S™ x S™, define functions v and / by 

v(x, y) = £ 6 (9$® > l) Qtf) , (130) 

and 



/ = l-u. (131) 

(mnemonic: v stands for victory and / for failure). 

If we substitute into Eq. (jl25b|) the value of 6(D(y) = m) for ML decoding, 
one finds for random encoding and ML decoding: 

p err = 1 -Y.QimQ^H^y)}" 1 - 1 . (132) 

x,y 

Later on, we will show that / ~ e~ nC . Since M = e nR , it follows that for random 
encoding and ML decoding 



Perr « \-{l-e- nC ) M (133a) 

« 1 - exp(-Me- nC ) (133b) 

« 1 -exp[-e n(it ~ c) ] (133c) 

« 1 - 0(i2 - C < 0) = 0(i2 > C) . (133d) 
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In Eq. f|133|) . we went from line (c) to (d) by using the following easy to prove identity: 
For all x ^ 0, 

8(x > 0) = lim exp[— exp(— nx)\ . (134) 

According to Eq. (|133d|) . if the code rate R is larger (ditto, smaller) than the channel 
capacity C, then the probability of error is one (ditto, zero). The next few sections 
of this paper will be dedicated to improving this estimate of p err . 



5.2 New (CCRG) Approximation for p err 

In this section and the next one, we will use CCRG to calculate the probability of 
error for channel transmission using random encoding and ML decoding. This section 
will calculate p err as given by Eq. (j!32|) , assuming that we have rescaled the variables 
on the RHS of Eq. (jl32j) so that the integrand is Gaussian. The next section will 
calculate the RG equations that characterize this rescaling. 

In what follows, we will use Q(x,y) to mean Q(x,y) = Q(y\x)Q(x), where 
Q(y\x) (ditto, Q(x)) is the probability distribution that specifies the transmission 
channel (ditto, the random encoding). We will also use the following abbreviations: 

Cl = EQ(^m(||^)=^fe:i), (135) 
AR = R-d, (136) 



AP(x, y) = P(x, y) - Q(x, y), AP(x\y) = P(x\y) - Q(x\y) , (137) 

Note that C\ is not equal to the channel capacity C, but C = ma,XQ x£pd ( Sx - ) C\. 
Applying Reduction Formula 1 to Eq. (jl32|) yields 



Perr = 1"^^^ (139a) 



l-r(n,N^)l VP„ "'^yW v« . (139b) 



- Ms„) - MP,,,} 



For n » 1, and fixed R, M = e nR » 1. Later on we will show that < / << 1. 
The inequalities M >> 1, and < / << 1, and Eq. (jl34p imply 

V M = (1 - f) M « e -Mf = e -exp(ni?+ln/) ~ Q(R _)_ —L < 0) . (140) 

n 
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Our next goal is to calculate ln(/). One has 



my) = £ my'W f^|4 < i) Q@) 
j j \Q{y\x) ) 



(141) 



x ,y 



Henceforth, we will abbreviate the probability distributions for the CC's C 
x' 



x 



and C I - I as follows: 

y' / 

P 
c\ 



X 



y 



p p 

1 x,y j 1 



c 



x' 



Using these abbreviations, one has 



(142) 



( Q(y\> 



\Q(y'\x>) 



\ „ /exp[n J2t 1, P(x, y) \nQ(y\x)] \ 
< 1 = 9 [ P[ ^ a ' y ~ ,yj — ^ Kyi n < 1 (143a) 
J \exp[nE x ,y P(x,y) In Q(y\x)} J 

= e(jJP-P]{x,y)\nQ{y\x) < o) . (143b) 
Substituting Eq. (|143bj) into Eq. (jl41j) and applying Reduction Formula 2 yields 



f(x,y) 



r(n, N &y - N, 



■ w -^ J vP x _ >y _ e(P x _, y _ > o) n {5(P(y) - P{y)\ y 

e(^[P-P}(x,y)L xy <0 

I\{P(x\y)} 



[U{Py}} 

exp[nHp(x\y) + nJ2 x , y P(x,y) lnQ(x)] 



• (144) 



\/x,y 



Note that 



Hence, 



D{P x J/Q x _P y ) = Y J P{x,y) 



x.y 



, P{x,y) . Q{x,y) Qfe)L, , 

In — r + In — — — . + In — 145a 

Q(x,y) Q(x)Q(y) P( y )] } 

(145b) 



D(P„J/Q &y ) + d + Y,&P(x,y)L xy - DiPyf/Q, 



x.y 



f{x,y) 



r{n, N X;y - Ny) exp[-nCi + nD{P y J /Q y )\ 
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I VP &y _ B{P &y _ > 0) J] {5(P(y) - P{y /} 
exp[-nD(P^ y //Q^y) - nJ2 x , y AP(x,y)L xy ] 



e(Y^[P-P}(^y)L xy <o] . (146) 



We will assume that, in the integrand of the previous equation, the inequality con- 
straint is active; i.e., that J2x,y AP(x, y)L xy = J2 x , y AP(x, y)L xy . Therefore, we can 

simplify Eq. (jl46|) by pulling e~ n ^- /x -y AP( ~ x ^ Lxy outside the integral to get 

r(n,N & y - N y )exp[-nCi+nD(P y //Qy) - nY,x, y AP(x,y)L xy ] 
= = = ^TTTT^Zlf 



[U{PyJ] 

J vP^ y _ 9{P^_ > o) n {5(P(y) - P(y 

exp [-nD{P^ y / 1 Qx, y )\ 



9[J2[P-P]^,y)L xy <0] . (147) 



To find ln(/) to leading order in n, we need to find the point P*(x, y) that dominates 
the integral on the RHS of Eq. ()147|) . To find P*, we must minimize the following 
Lagrangian with respect to P, A, and \i y : 

C = D(P x J/Q^ y ) - X (j2(P ~ P)(x, y)L xy ) + £ - P)(y) . (148) 

\x,y ) y 

The Gaussian approximation for the previous Lagrangian is: 



£ = E [ % P n t V I ~ A fe( p " P X X > y) L «y) + £M P ~ P)(y) • (149) 

x,y £\°£\-l", y) \x,y ) y 

Assume that the exact Lagrangian of Eq. ()148|) is well approximated by its Gaussian 
approximation. (This assumption is not necessary and will be removed later, in 
Appendix lEl) Let 

d xy = L xy - Q{x'\y)L x > y , (150) 
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(a) = Q(x, y)a xy = , (151) 



[a 2 ) = J2Q(^y)< ( 152a ) 

x,y 

= ^2Q{x,y)L xy a xy . (152b) 



Minimizing Eq. (jl49j) with respect to P, A, and fi y yields 

-J2 x>y P(y)AP(x\y)L xy 



X = ^ v 7 2 , v * , (153) 
{a 2 } 



and 



AP*(x, y) = -\Q(x, y)a xy - Q(x\y)AP(y) . (154) 
If £* is the value of £ at the extremum, then 

where, to lowest order in AP, t is given by 

/2 



£ 



2(a 2 } 

where 



(156) 



e = Y,P(y)AP(x\y)L xy . (157) 

Now that we know £*, we can apply Laplace's Method to the integral on the 
RHS of Eq.flllZD to get 

ln(/) « -nC 1 + nD{P y //Q y )-nY,AP(x,y)L xy -n£* (158a) 
« -nld + ^AP^t/^ + t] . (158b) 
This value for ln(/) can be inserted into Eqs. ()139j) and (|140jl to get 



p err « 1 - r(n, A^ >2/ ) / £>P 3 
exp[-nD(P Xiy //Q Xjy )] 



n{Px,y} x,y 



6{AR-J2AP{x,y)L xy -t<0) . (159) 
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Assume that the integral of the previous equation has been rescaled so that 
its integrand is in the Gaussian regime. Then 



Pe 



r(n, N^y) 

[AP{x,y)f 



VP. 



x.y 



exp[— n 



[Y J ^P{x,y)L xy >^R) 



Let 



x,y 

(/3 2 ) = E^.v)ft ■ 



(160) 

(161) 
(162) 
(163) 



Applying the Gaussian Integration Formulae of Appendix O to the RHS of Eq. (jl6()jl 
yields 



Perr - 1 - "Ufc {^J^) , 



(164) 



where 



u 



Il{P£,y} 



\ n{Q & y) ' (165) 

To find the dominant point P* y alluded to in Eq. (jl65j) . one must minimize the fol- 
lowing Lagrangian with respect to P, A and /i: 



[AP(x, 



x ,y 2 Q(%,y) 

One finds that the extremum is at 



X(AR - ]T AP(x, y)L xy ) + fi £ AP{x, y) 



x.y 



x.y 



where 



AP*(x,y) = B(x,y)AR 
(3x y Q(x,y) 



B(x,y) 



(P 2 ) 



(166) 



(167) 



(168) 



Substituting this value for AP* into Eq. (jl65|) gives 
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•fl I+ fL; (169) 

Note that this paper has exposed a close analogy between noiseless and noisy 
coding, as far clS Perr 

concerned. For example, Eq. (J7U|) for noiseless coding is 
analogous to Eq. ()159|) for noisy coding. Likewise, Eq.fjHU) is analogous to Eq. ()164|) . 

5.3 RG Equations 

In this section, we will calculate the RG equations for channel transmission using 
random encoding and ML decoding. 

For noiseless coding, the RG equations arose from rescaling Eq. (j70|) . In the 
case we are now considering, that of noisy coding, the RG equations arise from rescal- 
ing Eq. (jl59j) . Note the close resemblance between these two equations. 

In the noiseless coding case, we found a RG equation for P% by assuming 
that the argument nD[P^j /Qx) of the exponential in the integrand of Eq. (j70|) was 
invariant under a change of scale. In analogy, for noisy coding, we find a RG for P^ y 
by assuming that the argument nD(P^ y / /Qx, y ) of the exponential in the integrand 
of Eq. (|159|) is invariant under a change of scale. We get 

dAP ^ {x) = - To (p(*),Q)AP«(x) , (170) 

where 

In the noiseless coding case, we found a RG equation for AR by assuming 
that the theta function in the integrand of Eq. (j70|) was invariant under a change of 
scale. In analogy, for noisy coding, we find a RG for AR by assuming that the theta 
function in the integrand of Eq. (jl59j) is invariant under a change of scale. We get 

^^ = -7i(P (s) ,Q)A^, (172) 

where 

»wq)=Sa T (pi>>% ■ (173) 

where 

T(P,Q) = T + t, (174) 

where 
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T = y £AP(x,y)L sv . (175) 

x,y 

For any real valued function f(s) of s > 0, define 

D f = ^(f)Ts- < 176 > 
Note that DP^ = AP and 71 = 70^. Substituting Eq.flTUD into Ea. (fT73jt gives 

7i (P, Q) = (l + Z ^^) 7o(P, Q) • (177) 

Eq. (j!56|) gives t to lowest order in e. It is easy to show that for such a.t,Dt = 2t, so 
71 = (1 + I;) To- In Appendix IEI we find t and 71 to all orders in e. 



5.4 Coda to Error Model 

It is customary [2j to end a discussion of noisy coding with random encoding with 
the following 3 observations. 

Replace C\ by Capacity. In C\, Q(x) and Q(y\x) are independent. The capacity 
is defined by C = maxQ x&p d(s x ) C\. Let Q* x e pd(Sx) be the probability dis- 
tribution Qx that maximizes C\ at fixed Q(y\x). The p er . r that we derived for 
random encoding depends on C\. It is advantageous to set Q% = Q* in p err 
since p err (C) < p err (Ci). 

Keep Best Codebook. The p err that we derived for random encoding was averaged 
over all possible codebooks k (there are N™ M of them). There must exist a 
"best" codebook n best among these such that p er r( K best) < Perri^) for all k, and 
therefore p er r(^best) < mean of (p er r(K)) K . 

Keep Ruly Half of Codebook. Suppose x\ < X2 < • • • < xn is a monotonically 
non-decreasing sequence of real numbers. Define partial sums S a ,b = x a + x a+ i + 
. . . + xt for a < b. The mean of the sequence is // = Si^n/N and its median is 
Xiv. It is easy to prove by contradiction that xn_ < 2p. 

Define the "unruly half S^ nT ""' y of a codebook to be the set of all m G Sm for 
which p eTr \ m is larger than the median of (p err | m )vmes 2ri • Thus, U S^ 1 ™^ = 
S^ 1 . If we remove the "unruly half "of a codebook, then we end up with a 
new codebook with half as big an M; symbolically, M ru i y = In the limit 
of large codeword size n, this does not affect the rate R too much. Indeed, 
Rmiy = \ M^p-) = Rail - ~ ln(2) -> -R a «- The advantage of keeping only the 
ruly half of a codebook is that p eTT \ m for all m e is bounded above by 

2p err (all). 
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6 Computer Results 



In this section, we will describe the algorithms used by the computer program WimpyRG- 
C1.0 to solve the equations of this paper, and we will give examples of typical inputs 
and outputs of said program. For more information about WimpyRG, see its source 
code and accompanying documentation. 



6.1 Old-Noiseless Approximation of p, 



err 



First, let us describe how WimpyRG calculates the old fashioned approximation for 
p err , in the case of noiseless coding. 

We shall indicate derivatives by primes. Previously, we defined 

Z(\)=J2Q(x) Th , (178) 

X 

7 (A) = XR-(l + X)\nZ(X) , (179) 

7 = max 7 (A), (180) 
and we showed that the probability of error is approximated by 

Perr = e"" 7 . (181) 

To maximize the function 7(A), WimpyRG uses the simple Newton Raphson 
(NR) method as follows. Note that only the range R e (0,lnA^) is of interest. It is 
easy to show that for all A > 0, if R G (H(Q), In N%), then 7(A) has a negative second 
derivative and 7'(0) = AR > 0. Hence, for R e (H(Q), In N%), 7(A) has a unique 
maximum at some point A = Ao > 0. The NR method is way of finding the zeros of a 
function / : Reals — > Reals. Suppose that f(x) — at x — a. We can Taylor expand 
f(x) to first order about this zero: f(x) f(a) + f'(a)(x — a). Thus, f(x) = implies 
x = a — f(a)/f'(a). This suggest the recursion relation: x n+ \ = x n — f(x n )/ f '(x n ) 
for n — 0, 1, 2, Replacing x by A, and f(x) by 7 '(A), one gets 

A n +i = A n — ^ j~ n \ . (182) 

WimpyRG uses the previous recursion relation to find the maximum of 7(A). This 
algorithm requires that we know the functions 7 '(A) and 7 "(A). These two derivatives 
can be computed explicitly as follows. Define 

Z n (A)=^Q(x)^[lng(x)f. (183) 

X 

Note that Z = Z . It is easy to show that 
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and 



^< A > = fl -" lZ ° + (TTk- (184) 



7 (A) = (i + xfzl ■ (185) 



6.2 New-Noiseless and New-Noisy Approximations of p 



err 



Next, let us describe how WimpyRG calculates the new (CCRG) approximation for 
p err , in the case of either noiseless or noisy coding. 

For both noiseless and noisy coding, we must solve the following pair of coupled 
RG equations. For s > 0, 



^— = - ll (P^ i Q)AR^ } (186) 

OS 



and 



f)A p( s ) ( X) 

^-i-i = - 70 (P W ,Q)AP( S )(X) (187) 

for all X G Sx, where S*x = Sx for noiseless coding and Sx_ = Sx,y for noisy coding. 
We must solve this pair of RG equations subject to the following pair of boundary 
conditions: At s = 0: 

Ai? (0) = AR , (188) 

and at s — Sfi n : 

= Qpf) + B(X)AR {s ^ ) , (189) 

for all X. 7q and 71 are known functions of P and Q. 70 is the same for both 
noiseless and noisy coding, but 71 is different. A_R is assumed to be known. AR 
equals R — H(Q) for noiseless coding and R — C\ for noisy coding. The test fractions 
$o(-P) Q) an d 3>i(-P, Q) are also known functions of P and Q. Sfi n is defined as any s 
large enough for the following to be true: $ (P( Sfln \ Q) « 1 and $i(P^ fln ^, Q) « 1. 
-B(x) is also a known function. It depends on Q but not P, and it differs for noiseless 
and noisy coding. 

Eqs. (jl86j) and (j!87)l can be solved recursively by performing the following 

steps: 

(1) Move Backwards (from s = sg n to s = 0) This step will be performed either 
at the beginning of the algorithm, or after performing step (2) below. If this 
step is being performed after step (2), then step (2) has just yielded a fresh 
value of AR( Sfin \ On the other hand, if this step is being performed at the 
beginnine; of the algorithm, take Aflf*' = 1(T 12 . © 
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Substituting AR^ into Ea. flTHHj) gives AP^. Hence we can solve Eq. (jl87j) 
numerically (using the Fourth Order Runge Kutta algorithm [7] ) to get AP^ (x) 
for all x E Sx and all s G [0, SfiJ. These AP^(i) values can in turn be used to 
calculate ji(P^ s \ Q) for each s G [0, SfiJ. 

(2) Move Forwards (from s = to s = Sf} n ) After step (1), we have a fresh value 
of 7i(P (s) , Q) for each s G [0, s fln ]. By virtue of Ea. (jTH%|) . Ai?(°) is also known. 
Hence we can solve Eq. (|186j) numerically (again, using the Fourth Order Runge 
Kutta algorithm ) to get AR^ Sfin \ 

One performs steps (1), (2), (1), (2), until the difference between two 
successive values of Ai?^ Sfln ^ is very small. 
Let 



£ = -erfc | AR^\ 



2 I \ 2(P) 



71 



(Sfln) 



(190) 



where n (Sfln ' ) = e Sfin n. The probability of error p err is approximately equal to £ for 
noiseless coding and to 1 — 8 for noisy coding. However, the quantities AR^ Sfin ^ and 
(f3 2 ) that appear in £ have different definitions for noiseless and noisy coding. 



6.3 Examples of Wimpy RG Input and Output 




R - H(Q) 



Figure 7: A plot of WimpyRG output for noiseless coding. 
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Fig|7| is a plot of WimpyRG output for noiseless coding. It gives p err as a 
function of R - H(Q), for n = 20 and = (.20, .30, .13, .37). H(Q) = 1.316. The 
maximum possible R is hx(N^) = 1.386. Curve Old , the old approximation of p err , 
is a plot of Eq. (|181j) . Let £ be given by Eq. (jl90j) . Curve Unren , the unrenormalized 
approximation of p er r, is a plot of £ with s^ n = (hence n^ Sfin ^ = e Sfin n = 20). Curve 
Ren , the renormalized approximation of p err , is a plot of £ with Sfi n = 7.5 (hence 
n (s Rn ) = e s Rnn = 3 616 o.8.) 

It appears from Fig|7|that curve Unren is always higher or equal to curve 
Ren . As expected, both the Old and Ren curves plummet towards p err = at 
R = In N & . 

Curve Old is not expected to be a good approximation for p err when R is close 
to H{Q). Indeed, for R = H(Q), 7 = 0, so e~ n7 is indeterminate because ny = 00 • 
. On the other hand, curve Ren is expected to behave best when R is near H(Q), 
in the sense that the closer R is to H(Q), the lower the value of Sfi n that is required 
to reach the quadratic regime. 

While generating the points (AR,p err ) plotted in FigJTJ WimpyRG also gener- 
ated certain figures of merit for each point. For example, when generating the point 
(AR,p err ) = (-0.15825,0.925769), WimpyRG also generated: 



number of cycles (max is 100) = 6 

test fraction (initial, final) = 0.15137, 0.00234084 
test fraction 1 (initial, final) = 0.397863, 0.0105788 
n (initial, final) = 20, 36160.8 
Delta R (initial, final) = -0.15825, -0.00271302 

R, unrenormalized error_prob, error_prob = 1.15793, 0.976272, 0.925769 



In this output, "initial" always refers to s = and "final" to s = Sfi n = 7.5. A "cycle" 
is defined as a single application of the Backward/Forward steps defined previously. A 
cycle takes the computer program from s = Sfi n to s = and back again. The "number 
of cycles" is how many cycles were required before reaching a reasonably constant (i.e. 
varying no more than 0.1% between successive cycles) value for AR^ Sfin \ Notice that 
test fractions $0 and $1 decreased substantially whereas n increased substantially in 
going from s = to s = Sfi n . Hurray! 

FiglSJis a plot of WimpyRG output for noisy coding. It gives p err as a function 
of R — C, for n = 20. The channel probability Q(y\x) for these plots is Q(0|0) = 
Q(l|l) = 0.99, Q(1|0) = Q(0|1) = 0.01 (a symmetric binary channel). The source 
distribution Q(x) is Q(0) = Q(l) = 0.5, as required to make C\ = C for a binary 
symmetric channel. For this Q(y\x) and Q(x), C = 0.637nats (or C = .919bits if one 
uses base 2 logs). Let £ be given by Eq. ()190|) . Curve Unren , the unrenormalized 
approximation of p err , is a plot of 1 — £ with Sfi n = (hence n^ s&a ' = e Sfln n = 20). 
Curves Ren2 , Ren3 and Ren4 , renormalized approximations of p err , are plots 
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Figure 8: A plot of Wimpy RG output for noisy coding. 



of 1 — £ with Sfi n = 7.5 (hence n^ fln ^ = e Sfin n = 36160.8.) To obtain curve Ren j for 
j G {2, 3, 4}, we used an approximation for t that included terms up to and including 
order e J . See Appendix IE1 

FigEl is a magnified view of a part of FigEl the part with the smallest values 
of AR. Each renormalized curve Ren j for j G {2,3,4} has endpoints a,- and bj 
such that the curve is shown only for AR G [aj, bj]. We found that our algorithm for 
obtaining Ren j broke down for AR < aj and AR > bj. There is no guarantee that 
the Runge Kutta algorithm that we use for solving the RG equations will not produce 
unphysical values such as a P (s) (X) g [0,1] or a 7i < at some intermediate step. 
Such unphysical values for P^^X) or jx were obtained by WimpyRG for AR < aj 
or AR > bj but not for aj < AR < bj. We conjecture that a curve Ren oo that used 
t to all orders in e would reach p err = and p err = 1 at finite values of AR. 
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Figure 9: Magnified view of part of FiglHl the part with the smallest AR values. 



A Appendix: Error Function 

This appendix reviews well known properties of the Error Function[7]. 
The Error Function is defined for real x by 



2 f x 
erf(x) = —= / <i£ e 

\/TT JO 



(191) 

erf(x) can be analytically continued to complex x, but we have no need to consider 
such an extension in this paper. The complement of the Error Function is defined by 

2 r°° C 2 

erfc(x) = 1 - erf (x) = -= / d£ e~* . (192) 



See FigEUfor a plot of erf (a;) and erfc(x). Under reflection x — ► — x, erf(x) obeys 

erf(— x) = — erf(x) , (193) 

and erfc() obeys 



erfc(— x) = 1 — erf(— x) = 2 — erfc(x) 



(194) 
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For real x such that \x\ « 1, 



Q / /v>3 /v>5 \ 

<ri w = ^( I -irii + r2i-7^ + -)- (195) 



For real x such that \x\ » 1, 



e rfcW = 2^<0) + ^(l-^ + ±±-i^| + ...) . (196) 

Claim A.l For a, 6, A G Reals with A, a > 0, 

6 , 1 /-A+i°o dA 



erfc(— = — / ^- exp(aA 2 - 6A) . (197) 
2ya 7ri Jh-ioo A 



proof: 



erfc(x) = — d£ e" f 0(f > x) (198a) 
yvr J-oo 

/ — / d£ exp(-£ 2 + A£ - Ax) (198b) 

J A—ioo A J—oo 



3 . 

71" 2 £ J A—ioo 



1 M+ioo W X A 2 

= -/ -%xp( --Ax). (198c) 

Til JA-ioo A 4 

In Eq. ()198|) . we went from line (a) to (b) by using the integral representation of the 
theta function, as given by Eq. (}20|) . Now make the replacements A — > 2y/a\, x — »■ tAj 
in Eq- PHj - QED 
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B Appendix: Taylor Expansions Related to Infor- 
mation Theory 

This handy appendix collects in one place several Taylor expansions that arise fre- 
quently in Information Theory. 

For real x such that \x\ < 1, 



00 (_iyi+i r « 

ln(l + a;) = J2— — ( 199a ) 

71=1 71 

= X - X - + X - + ... . (199b) 



Thus, for |*| < 1, 



h 00 ( — 1 \n+l(h\n 

ln(x + h) = ln[x(l + -)} = \nx + J2- " — (200a) 



x ~ n 



lnx + ±-^ + ... , (200b) 



OO / 1 \ n / 1 \ n ~~ 1 



'—l) n h 

(x + h)\n(x + h) = xlnx + h(lnx+ 1) + -) — ^ -1 (201a) 

^ 2 n(n- 1) \xj 

h 2 

= xlnx + h(lnx+l) + — + ... . (201b) 

ZiJb 

Let AP(ar) = P(x) - Q{x). Then 

#(P) = -J2P(x)\nP(x) (202a) 

= H(Q)-^AP(x)]nQ(x)-^ [ -^^ + 0((AP) s ), (202b) 

and 



D{P//Q) = £P(s)ln^j| (203a) 
= + 0((AP) 3 ). (203b) 



42 



C Appendix: Gaussian Integration Formulae 

In this appendix, we present certain integration formulae that contain a Gaussian 
times a delta or a theta function in the integrand. 

The following lemma will be used to prove Claim IC~T1 which is the main result 
of this appendix. 

Lemma C.l Suppose A £ Reals nxn is invertible, v £ Reals nxl , v T A~ 1 v ^ 0, < 
e << 1, and 

T 

vv 

B = A + —. (204) 
Then the inverse and determinant of B are given by 

T 

VV 

v T A~ 1 v 

and 



B- 1 = A' 1 - A^AA- 1 where A = , (205) 



T 1 

det B = det(A) - - . (206) 

proof: 

It is easy to show that if u and v are n dimensional column vectors and 

B = A + uv T , (207) 

then 

, , i A~ 1 uv T A~ 1 

B- 1 = A' 1 - — — - 208 

1 + v T A~ 1 u v ' 

satisfies BB~ L = B~ X B = 1. Setting u = v/e and taking the limit e — > yields 
Eo.p7B]l. 

To show Eq. (j206p . recall that 

In(detA) = tr(lnA) . (209) 

(This well known identity is obvious when A is diagonal. The proof is also very 
simple when A is non-diagonal but diagonalizable.) If the entries of A are taken to 
be independent variables, then Eq. (j209|) implies 



<nn(det A) = t^A-HA) = ^{A-^jSAji . (210) 

id 

Therefore, 

/ a u 9 , , . 1 ^(det A) , , 
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This is just the usual expansion of A' 1 in terms of cofactors. For defmiteness, suppose 
A is a 3 x 3 matrix with columns d[, a~2, 03. Suppose u and v are also 3x1 column 
vectors. Then 



det (A + uv T ) = detpx + t^-u, a 2 + v 2 u, a 2 + t> 3 u] (212a) 
= det A + det[v 1 u, a 2 , a.3] + detfai, v 2 u, S3} + detfai, a 2 , v 3 u] (212b) 

= det(A) + ^^^v* (212c) 
= det(A)(l + v T A~ 1 u) . (212d) 

In Eq. (J212j) . we went from line (a) to (b) by using the fact that determinants are 
linear functions of each column. We also used the fact that determinants with a pair 
of proportional columns are zero, so that, for example, 

det[v l u,v 2 u,a 3 ]=0 . (213) 
Now setting u = v/e in Eq. (j212j) yields 



det {B) = det {A) (l + (214a) 
~ det(A) - j . (214b) 

QED 

Claim C.l For x, b e Reals Nxl and A G Reals NxN , define a measure dG(x) so that 
for any real valued function f(x), 



j dG{x) f(x) = II { [ +0 ° dx 3 \ exp (j^Al + b T x \ f{x) . (215) 



Suppose A is a real, positive definite, symmetric matrix. Suppose u, v 6 Reals Nxl 
and a G Reals. Define 

T 

~ 1111 ~ 

A=— -— - , B- 1 = A~ 1 -A- 1 AA- 1 . (216) 



Then 



N 
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/ dG(x) 5(v T x) = [[ dG(x) 1] = exp 



(217b) 



J dG{x) d(u T x -a>0) = [J dG{x) l]-erfc 



(217c) 



dG(a;) ^ J # J i-a> 0) = [ / dG{x) 5{v l x)]-erfc 



a-u T B~ l b 
V2u T B- 1 u 



(217d) 



proof of Eq.(|2T7aD : 

Since A is symmetric, it can be diagonalized. By diagonalizing A, one can 
convert / dG(x)l into a product of one dimensional Gaussian integrals, 
proof of Ea. (pT7E|) : 

For < e << 1, 



8(v T x) 



V2 



1 f-(v T x) 2 ' 
exp 



7T6 



2e 



Define S by 



£ = A + 



T 
V V 



Then 



dG(x) 5(v T x) 



n 



+oo 



1 (2 



7T 



v 7 ^ Vdet 5 



exp 



(218) 



(219) 



-x T Bx 



exp - 

J VieZi,iv \ ^ 



+ 6 T x (220a) 



(220b) 



Now use the values for B 1 and det B calculated in Lemma IC.ll 
proof of Eq.flH7cD = 



/ dG(x) 9(u T x - a > 0) = / dG(x)— [ 
J J 2ni J a 



A+ioo d\ 



A— ioo A 

— / — dx N exp[ + (b + Xuyx - Act 

Z7TZ JA-ioo A J V 2 J 

N 

1 r A +™d\ (2tt)^ /(& + Am) t A~ 1 (6 + Am) x n 

/ — exp Act 

27ri Ja-joo A Vdet A V 2 



(221a) 
(221b) 

(221c) 
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) + \{u T A- l b - a) 



(221d) 



(221e) 



In Eq. (j221|) . line (a), we used the integral representation of the theta function given 
by Eq.flHD- In Ea. (12211) . we went from line (b) to (c) by applying Eq. (j2T7aD . We 
went from line (d) to (e) by applying Eq. (|197j) . 
proof of Ea. (pTTd|) : 

This proof is similar to that of Eqs. (j22(J|) (a), (b) and (c) so it is left to the 
reader. QED 

D Appendix: An Integral Over All 
Joint Probability Distributions 
with a Fixed Marginal 

In this appendix, we will show how to convert (1) to (2) where (1) is an integral 
over all joint probability distributions P &y with the same marginal P y , and (2) is an 
integral over all conditional probability distributions P^ y - 

Claim D.l 



Let RHS (ditto, LHS) stand for the right (ditto, left) hand side of Ea. (12221) . 
Suppose G S x . Then 



/ 




(222) 



proof: 



LHS 



Jl[{dP(x,y)} 



J] <e[0 < £ P(x, y) < Q(y)} 9(P^ y > 0)/(P) (223a) 




n{[^(?/)]^ l } v Jn{^(-b)w. 
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n 0(0 < £ < 1) 9(P^ y > O)f(P) (223b) 



x: x^O 



V.y 



= PPS. (223c) 

QED 

E Appendix: Perturbation Expansion of t 

In Eq. (jl56|) . we gave t to lowest order in AP. In this appendix, we show how to 
calculate t exactly, as a Taylor series in powers of AP. 

The point P* that dominates the integral Eq. (|147|) is an extremum of the 
Lagrangian Eq. (jl48|) . In Section f5.2l we approximated the Lagrangian Eq. (jl48|) by 
its quadratic approximation Eq. (jl49J) . This gave us the dominant point P* only to 
lowest order in AP. This time we will use the exact Lagrangian and get the exact 
dominant point. Let us re-state the exact Lagrangian: 



(224) 



£ = D(P &y _//Q^) - X (j2(P ~ P)(x, y)L xy ) + £ % (P - P)(y) . 

Minimizing this Lagrangian with respect to P, A and fi y gives 

b*( \ Q(x\y)exp(-XL xy ) 

P{x 1 y) = YJx) W ' ( ' 

where 

Z y (X) = £ Q(x\y) exp(-XL xy ) . (226) 

X 

The parameter A in Eq. ()225|) is specified implicitly by the equation: 



•£,P(x,y)L x „ = Z 9m^t^A P{y)Lxt (227a) 

x,y x,y ^y\^) 

= -ZHy) d -^- (227b) 

The previous equation can be rewritten as 

= e + P(A), (228) 

where e and P(A) are defined by 



47 



e = Y l P(y)*P(x\y)L 



xy j 



and 



d\nZ y (X) fd\nZy(X) 



dX 



dX 



A=0. 



(229) 



(230) 



Next we will solve Eq. (|228|) for A by expressing A as a Taylor series in powers 
of e. We begin by expressing the RHS of Eq. ([226)1 as a Taylor series in powers of A: 



where 



Zy(X) = E 
fc=0 



A k (y)(-XY 
k\ 



where 



for k = 0,1,2, • • 



A k {y) = Y,Q(Ay){L xy ) k 

X 

It follows that 

A 2 A 3 
In Z y (X) = aiA + a 2 — + a 3 y + • • 

d = -At , 



(231) 



(232) 



(233) 



(234a) 



Define 



a 2 = -A\ + A 2 



a 3 = -A{ + -ArA 2 - ~A 3 , 



a 4 = -A\ + 2A 2 A\ - -A X A 3 - \a\ + ^ 4 

3 2 o 



«fe = J2 p (y) a k(y) 



for fc = 1, 2, 3, If we express F(X) as a Taylor series in powers of A 

F(X) = FiX + F 2 X 2 + F 3 A 3 + . . . , 
then, by virtue of Eos. ([230)1 . flSBD and (J235D, 



(234b) 
(234c) 
(234d) 

(235) 
(236) 



(237) 
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for k = 1, 2, 3, . . .. Eq. (j228|) can be expressed as a Taylor series in powers of A: 

= e + Fi\ + F 2 A 2 + F 3 A 3 + . . . . (238) 
A itself can be expressed as a Taylor series in powers of e: 

A = Aie + A 2 e 2 + A 3 e 3 + . . . . (239) 



Substituting Eq. (j239|) into Eq. ()238|) yields an equation for each power of e. These 
equations for each power of e imply: 

Ai = ^ , (240a) 

A 2 = ^ , (240b) 

As = , (240c) 

x _-5F| + 5W 1 ^?_ (24Qd) 



Fl 

Now that we know P* y explicitly (in terms of Eq. (j225|) . where A is expressed 
as a Taylor series in powers of e) , we can find explicitly C given by Eq. ()224|) evaluated 

C* = ^(P;/M) + EP*(,,y)ln(™) (241a) 

= D ( p y//Qy) + E V) In ( ^"^ ) (241b) 

= J D(P,//g £ )-AE^,y)^-E P (l/) ln ^(A)- (241c) 

Expanding the lnZ y (A) in the previous equations in powers of A yields 



C* = D(P y //Q y )-X^P(x,y)L 



xy 



-(Aa 1 + A 2 ^ + A 3 ^ + ---) (242a) 



D(P y _//Qy) - (Ae + A 2 ^ • A 3 ^ ••••). (242b) 



Expanding A in the previous equation in powers of e yields 
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where 



and 



C* = D(P y //Q y )+t, (243) 



t = t x e + t 2 e 2 + t 3 e s + • • • , (244) 



h = , (245a) 



t2 = ^~, (245b) 
2a 2 

(245c) 



3a 



3 ' 



2ai — a^a 2 



Now that we know £ to all orders in e, we can also find 71 to all orders in e. 
Recall from Section IB~3l that for any real valued function /(s) of s > 0, 

so that DP^ = AP. When / is the fcth power of e, 



De k = ke k - 1 J2[^P(x,y)-Q(x\y)AP(y)}L xy (247a) 
= ke k . (247b) 



From Eg. ()244j) one gets 



Dt = { * + + + 

1 +eDt 1 + e 2 £>t 2 + e 3 Dt 3 + . . . 



One can use Eqs. <|245|) to calculate Dtk in terms of {au}\/k and {Dau}\/k- For example, 
Dt 2 = ^Da 2 . By Eq.( 



Da fc = £AP(y)a fc (y), (249) 

for = 1, 2, 3, Once we know t and Dt to all orders in e, we can use Eq. (jl77|) to 

find 71 to all orders in e. 



50 



References 



[1] Nigel Goldenfeld, Lectures on Phase Transitions and the Renormalization Group 
(1992, Perseus Books). 

[2] T.M. Cover, J. A. Thomas, Elements of Information Theory (1991, John Wiley). 

[3] R.E. Blahut, Principles and Practice of Information Theory (1987, Addison- 
Wesley) 

[4] G.F. Carrier, M. Krook, C.E. Pearson, Functions of a Complex Variable (1966, 
MacGraw-Hill); N. Bleistein, R. A. Handelsman, Asymptotic Expansions of In- 
tegrals (1986, Dover). 

[5] R. Fletcher, Practical Methods of Optimization (2000, John Wiley). 

[6] An alternative method of getting a good trial value for AR^ SRn ^ is as follows. Note 
that 7o(-P, Q) and 71 (P, Q) both tend to \ as P — > Q. Thus, a good trial value for 
A_R(»fln) i s e^AR. Plug this value of AR^ into AP(X) = B(X)AR^ and 
check that it gives P ( - Sfin \X) e [0, 1] for all X. If not, then continue halving the 
trial value of AR^ until P (s « n) (X) G [0, 1] for all X. This occurs eventually, 
assuming Q(X) ^ for all X. 

[7] M. Abramowitz, LA. Stegun, Handbook of Mathematical Functions (1974, 
Dover) . 



51 



